multiplicative weight update
Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos
The Multiplicative Weights Update (MWU) method is a ubiquitous meta-algorithm that works as follows: A distribution is maintained on a certain set, and at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon C(\gamma))> 0$ where $C(\gamma)$ is the ``cost of action $\gamma$ and then rescaled to ensure that the new values form a distribution. We analyze MWU in congestion games where agents use \textit{arbitrary admissible constants} as learning rates $\epsilon$ and prove convergence to \textit{exact Nash equilibria}. Interestingly, this convergence result does not carry over to the nearly homologous MWU variant where at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon)^{C(\gamma)}$ even for the simplest case of two-agent, two-strategy load balancing games, where such dynamics can provably lead to limit cycles or even chaotic behavior.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- North America > Canada > Quebec > Montreal (0.05)
- (5 more...)
Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos
The Multiplicative Weights Update (MWU) method is a ubiquitous meta-algorithm that works as follows: A distribution is maintained on a certain set, and at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon C(\gamma))> 0$ where $C(\gamma)$ is the ``cost of action $\gamma$ and then rescaled to ensure that the new values form a distribution. We analyze MWU in congestion games where agents use \textit{arbitrary admissible constants} as learning rates $\epsilon$ and prove convergence to \textit{exact Nash equilibria}. Interestingly, this convergence result does not carry over to the nearly homologous MWU variant where at each step the probability assigned to action $\gamma$ is multiplied by $(1 -\epsilon)^{C(\gamma)}$ even for the simplest case of two-agent, two-strategy load balancing games, where such dynamics can provably lead to limit cycles or even chaotic behavior.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (3 more...)
- Asia > Singapore (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Germany > Saarland (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (11 more...)
Log-Normal Multiplicative Dynamics for Stable Low-Precision Training of Large Networks
Nishida, Keigo, Kıral, Eren Mehmet, Bannai, Kenichi, Khan, Mohammad Emtiyaz, Möllenhoff, Thomas
Studies in neuroscience have shown that biological synapses follow a log-normal distribution whose transitioning can be explained by noisy multiplicative dynamics. Biological networks can function stably even under dynamically fluctuating conditions arising due to unreliable synaptic transmissions. Here we ask: Is it possible to design similar multiplicative training in artificial neural networks? To answer this question, we derive a Bayesian learning rule that assumes log-normal posterior distributions over weights which gives rise to a new Log-Normal Multiplicative Dynamics (LMD) algorithm. The algorithm uses multiplicative updates with both noise and regularization applied multiplicatively. The method is as easy to implement as Adam and only requires one additional vector to store. Our results show that LMD achieves stable and accurate training-from-scratch under low-precision forward operations for Vision Transformer and GPT-2. These results suggest that multiplicative dynamics, a biological feature, may enable stable low-precision inference and learning on future energy-efficient hardware.
Langevin Multiplicative Weights Update with Applications in Polynomial Portfolio Management
Feng, Yi, Wang, Xiao, Xie, Tian
We consider nonconvex optimization problem over simplex, and more generally, a product of simplices. We provide an algorithm, Langevin Multiplicative Weights Update (LMWU) for solving global optimization problems by adding a noise scaling with the non-Euclidean geometry in the simplex. Non-convex optimization has been extensively studied by machine learning community due to its application in various scenarios such as neural network approximation and finding Nash equilibrium. Despite recent progresses on provable guarantee of escaping and avoiding saddle point (convergence to local minima) and global convergence of Langevin gradient based method without constraints, the global optimization with constraints is less studied. We show that LMWU algorithm is provably convergent to interior global minima with a non-asymptotic convergence analysis. We verify the efficiency of the proposed algorithm in real data set from polynomial portfolio management, where optimization of a highly non-linear objective function plays a crucial role.
- Asia > Middle East > Jordan (0.04)
- Asia > China > Shanghai > Shanghai (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (3 more...)
Review for NeurIPS paper: Learning compositional functions via multiplicative weight updates
Weaknesses: I was not totally convinced by the experiments section, and have questions about that section and some more general questions which the authors might address: 1. The way that Figure 1 is laid out suggests that it is appropriate to compare the three algorithms over the same set of values of eta. Can the authors justify this? It seems to me that the meaning of eta in the Madam algorithm is different to its meaning in SGD and Adam (it's effectively a coincidence that these different hyper-parameters share a name). What happens if you evaluate Madam over a denser grid of eta values and then zoom in the x axis of the left hand plot? 2. The value of the transformer, on the wikitext-2 task, for SGD and Madam, seems very high. Perhaps the authors are using a different unit of measurement?
Review for NeurIPS paper: Learning compositional functions via multiplicative weight updates
This is a good paper which combines insights from optimization, hardware, and neuroscience to give a multiplicative weight update for neural nets. It seems worthwhile to try out multiplicative updates in the context of modern architectures, and this paper seems to have made them competitive with existing optimizers, in a way that allows lower-precision computation (as low as 8 bits). As far as I can tell, there isn't a clear advantage for current hardware, but this serves as a good proof-of-concept that could help inform future hardware design. While no particular insight is particularly deep, everything is combined in an interesting and cohesive way, so the reviewers and I think this paper is definitely above the bar for acceptance. I encourage the authors to account for the reviewers' feedback in the camera ready version.